Discretization Numerical Data for Relational Data with One-to-Many Relations

نویسنده

  • Rayner Alfred
چکیده

Problem statement: Handling numerical data stored in a relational database has been performed differently from handling those numerical data stored in a single table due to the multiple occurrences (one-to-many association) of an individual record in the non-target table and non-determinate relations between tables. Numbers in Multi-Relational Data Mining (MRDM) were often discretized, after considering the schema of the relational database. Study the effects of taking the one-to-many association issue into consideration in the process of discretizing continuous numbers. Approach: Different alternatives for dealing with continuous attributes in MRDM were considered in this study, namely equal-width (EWD), Equal-Height (EH), equal-weight (EWG) and Entropy-Based (EB). The discretization procedures considered in this study included algorithms that were not depended on the multi-relational structure of the data and also that are sensitive to this structure. A new method of discretization, called the entropy instance-based (EIB) discretization method was implemented and evaluated with respect to C4.5 on the two well-known multi-relational databases that include the Mutagenesis dataset and the Hepatitis dataset for Discovery Challenge PKDD 2005. Results: When the number of bins, b, is big (b = 8), the entropy-instance-based discretization method produced better data summarization results compared to the other discretization methods, in the mutagenesis dataset. In contrast, for the hepatitis dataset, the entropy-instance-based discretization method produced better data summarization results for all values of b, compared to the other discretization methods. In the Hepatitis dataset, all discretization methods produced higher average performance accuracy (%) for partitional clustering technique, compared to the hierarchical technique. Conclusion: These results demonstrated that entropy-based discretization can be improved by taking into consideration the multiple-instance problem. It was also found that the partitional clustering technique produced better performance accuracy compared to the one produced by hierarchical clustering technique.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Discretization Numbers for Multiple-Instances Problem in Relational Database

Abstrak Handling numerical data stored in a relational database is different from handling those numerical data stored in a single table due to the multiple occurrences of an individual record in the non-target table and non-determinate relations between tables. Most traditional data mining methods only deal with a single table and discretize columns that contain continuous numbers into nominal...

متن کامل

Relational Semantics for Databases and Predicate Calculus

The relational data model requires a theory of relations in which tuples are not only many-sorted, but can also have indexes that are not necessarily numerical. In this paper we develop such a theory and define operations on relations that are adequate for database use. The operations are similar to those of Codd’s relational algebra, but differ in being based on a mathematically adequate theor...

متن کامل

Compiling Parallel Sparse Code for User � De ned Data

We describe how various sparse matrix and distribution formats can be handled using the relational approach to sparse matrix code compilation This approach allows for the development of compilation techniques that are independent of the storage formats by viewing the data structures as relations and abstracting the implementation details as access methods Introduction Sparse matrix computations...

متن کامل

Discovery of spatial association rules in geo-referenced census data: A relational mining approach

Census data mining has great potential both in business development and in good public policy, but still must be solved in this field a number of research issues. In this paper, problems related to the geo-referenciation of census data are considered. In particular, the accommodation of the spatial dimension in census data mining is investigated for the task of discovering spatial association r...

متن کامل

Mining Frequent Patterns in Uncertain and Relational Data Streams using the Landmark Windows

Todays, in many modern applications, we search for frequent and repeating patterns in the analyzed data sets. In this search, we look for patterns that frequently appear in data set and mark them as frequent patterns to enable users to make decisions based on these discoveries. Most algorithms presented in the context of data stream mining and frequent pattern detection, work either on uncertai...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009